Subject specific search engine

ABSTRACT

A subject-specific search engine utilizes a smart web crawler and includes a capability of filtering out sites not relevant to the particular subject. As the smart crawler traverses the Internet, sites are filtered, and only sites found relevant are indexed and stored in a database for later searching. Sites may be filtered an arbitrary number of times for relevance, and such filtering may, for example, comprise automated, lexicon-based filtering; manual filtering, using a human editor; or both.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The present invention is directed to the general field ofsystems, methods, and computer program products for performingInternet-based searching. In particular, it deals with a search enginetailored to search the Internet and to return results that contain fewerirrelevant results than present search engines return.

[0003] 2. Related Art

[0004] It has been said that the Internet/network communities are whatare pushing the economy forward these days, and it is a fact, that theInternet contains unprecedented volumes of information on just about anytopic. The only problem is to find the truly relevant resources. Searchengines are what make the Internet useful, because without these toolsthe chances of finding relevant resources would be significantlydiminished. Thus, while the Internet drives the economy, search enginesdrive the Internet. This is backed by statistics made on users' use ofthe Internet, which shows that users spend more online time at searchengines than anywhere else, including portals.

[0005] Yet current search engine technology often leaves onedissatisfied and frustrated, particularly where one would like to findresources on a given subject in a specific context. For example, supposethat a user would like to find information on the Ford Pinto in a legalcontext (referring to the product liability cases against Ford due todefects in the Ford Pinto models design). A general purpose searchengine (GPSE) will typically return numerous irrelevant links if onesearches on the term “Pinto,” simply because a GPSE can not recognize acontext or a specific subject, e.g. a legal context or law as a subject.This is so due to the fact that GPSEs adopt the strategy of “everythingis relevant;” therefore, they try to collect and index all pages on theInternet. Their operations are based on this unedited collection ofpages.

[0006] To gain more insight into the workings of GPSEs, it is firstworth noting that the term “search engine” is typically used to coverseveral different types of search facilities. In particular, “searchengines” may be broken up into four main categories: robots/crawlers;metacrawlers; catalogs with search facilities; and catalogs or linkcollections.

[0007]FIG. 1A illustrates the operation of robots/crawlers. These arecharacterized by having a process (i.e., a crawler) that traverses theInternet 1, as indicated by arrow 4, on a site-by-site basis and sendsback to its host machine 2 the contents of each home page it encountersat various sites 3 on its way, as indicated by arrows 5. Then, as shownin FIG. 1B, the host machine 2 indexes the pages 8 sent back by crawler7 and files the information in its database 9. Any front-end query looksup the search terms in the information stored in the host's database 9.Existing crawlers generally consider all information to be relevant, andtherefore, all home pages on all sites traversed are indexed. Examplesof such robots/crawlers include Google™, Altavista™, and Hotbot™.

[0008] Metacrawlers, as illustrated in FIG. 2, are characterized in thatthey offer the possibility of searching in a single search facility 2and obtaining replies from multiple search facilities 10. Themetacrawler serves as a front end to several other facilities 10 anddoes not have its own “back end.” Metacrawlers are limited by thequality of the information in the search facilities that they employ.Examples of such metacrawlers include MetaCrawler™, LawCrawler™, andLawRunner™.

[0009] Catalogs, with or without search facilities, are characterized inthat they are collections of links structured and organized by hand. Inthe case of a catalog with a search indexed depends on the particularGPSE. A user can enter a query into the front-end, and the GPSE willsearch the indexed pages. This procedure is based on the principle of“everything is relevant,” meaning that the crawler will get and saveevery page it encounters. Similarly every page saved in memory by thecrawler will be indexed. This typical operation of a GPSE is illustratedin FIGS. 1A and 1B, as discussed above (indexing part not shown).

SUMMARY OF THE INVENTION

[0010] The present engine takes the form of a subject specific searchengine (SSSE), where the strategy adopted is to collect and index onlythe pages deemed relevant for a specific subject, e.g., law or medicine.The way this is done, in one embodiment of the invention, is through alexicographic analysis of the texts used by the profession or area ofinterest. The inventive technology is able to differentiate amongcontexts and to thus provide a given profession with a search enginethat returns only links to relevant resources, i.e., resources thecontents of which contain a query term in the desired context. Drawingfrom the “Pinto” example discussed above, a search for “Pinto” in alegal search engine according to the present invention will thus onlyreturn results where the query term “Pinto” appears in a legal context.Put another way, it will return only legal documents or legally relevantdocuments containing the term “Pinto”.

[0011] To further understand the advantages of an SSSE according to theinvention over GPSEs, consider the following scenario:

[0012] Imagine a public library that keeps all of its books in a hugepile. Suppose an attorney needs to find some information about theproduct liability case brought against Ford for their design faults inthe Pinto model.

[0013] Now imagine that the attorney goes to the public library. In thispublic library all the books are placed in the back, and in order toretrieve any books one must approach a librarian and tell her which wordone is looking for. In this case the attorney is looking for “Pinto.” Inless than a second the librarian is back and places 460,000 books infront of the attorney. Depending on the public library these books maynot be ordered at all, ordered by the number of times the word “Pinto”appears in each book, or by other people's references to each book.

[0014] To find a book covering the Pinto case, suppose that the attorneystarts looking at the title of each book, and if it seems interesting hereads the back cover. It will not be long before he finds himselfreading about Pinto horses, families having Pinto as a surname, the ElPinto restaurant, etc. Once in a while he will find a book that actuallyis about the Ford Pinto case. If he has the patience and the time, hewill find the type of book he is looking for somewhere along the line.If he is able to scan through the books at a rate of one book persecond, he will be finished in approximately five and a half days. Theend result may be some 500 books.

[0015] To avoid this, the attorney currently has two choices:

[0016] He can use Boolean algebra, if he is familiar with it, bychanging the query to something like ‘“Ford Pinto” AND (“productliability” OR “punitive damages”).’ To ensure that he gets all therelevant books, he should also enter all kinds of legal terms (the U.S.legal terminology consists of approximately 20,000 terms).

[0017] He can find the “legal librarian” (or, in Internet terms, ametacrawler, like LawRunner.com or LawCrawler.com). The legal librariandoes some of the work that the attorney must do in the preceding option,that is, the librarian makes sure that both the original word, “Pinto,”and either the word “legal” or the word “law” is in the books that thelibrarian returns. It might, however, seem a bit inadequate to get onlytwo terms out of the 20,000 terms mentioned above (i.e., withoutentering the rest manually).

[0018] The attorney may, however, have a third option, a specializedlibrary (for example, a university department's library, like a lawschool library or an engineering school library). A specialized libraryis a library specializing in one subject; in the present example, theappropriate library would be a law library. If the attorney were to askthe librarian here, the “Pinto” query would result in, perhaps, 500books. The key here is that, before any book is placed in thespecialized library, it has been classified as relevant in the library'sparticular context. That is, someone actually sat down, looked throughthe book, and decided that it contained relevant material. As a result,in the present example, all the books about Pinto horses, families, andthe like, never make it into this library, thus eliminating the hassleof ignoring them.

[0019] The inventive SSSE draws upon some of the concepts of this thirdoption. In particular, the inventive SSSE provides a particularprofession (or more generally, special interest group) with a searchengine that returns only links to resources that contain components ofthe profession's terminology.

[0020] The inventive SSSE starts with the principle that not all pagesor even sites are relevant. If one is building an SSSE for United Stateslaw, pages from sites like www.games.com and www.mp3.com are generallynot relevant. A human would “know” that a site with the namewww.games.com is not likely to contain pages with a relevant content forthe legal profession. The question then is how to make a computer system“know.”

[0021] In an SSSE according to an embodiment of the present invention, afirst feature is that the crawler may perform filtering and indexing, inaddition to merely finding information. This means that the crawler isnow “aware” of the analysis of each web page and can act accordingly.

[0022] A second feature of an SSSE according to an embodiment of thepresent invention is the addition of a new field in the databasecontaining the information stored by the crawler.

[0023] This field holds a parameter referred to as the “depth.” Thedepth is the number of preceding pages that were traversed and weredeemed not relevant.

[0024] A third feature of an SSSE according to an embodiment of thepresent invention is the setting of a threshold for how deep the crawlerwill be permitted to crawl down a branch before it is stopped. That is,how many irrelevant pages in a row will be allowed before the branch maybe considered entirely irrelevant.

[0025] In one embodiment of the invention, the crawler is designed so asto filter each site it traverses using a database of relevantterminology. In another embodiment of the invention, the information issent to the host, and all analyzing processes are left to the hostcomputer running the crawler. The web page corresponding to each sitethat is passed through the filter and deemed preliminarily relevant maythen be filtered one or more additional times. Filtering may beperformed either automatically or in conjunction with a human. In thecase of automatic filtering, the additional filtering may be performedeither as part of the crawler or as a process running on a hostcomputer. Pages that are passed through as many filtering stages as arepresent and are deemed relevant are then indexed and stored in adatabase.

[0026] To provide users with ease in retrieving the most relevantinformation, an embodiment of the invention utilizes a ranking systemfor determining which pages are most relevant. The ranking system isbased on the computation and storage of word rankings and thecomputation of site (page) rankings, based on the word rankings, inresponse to user queries. Rankings are then used to display the sitesretrieved in the search in accordance with their rankings, so as to givedisplay priority to the most relevant sites.

[0027] Also for the sake of user-friendliness, an embodiment of theinvention utilizes a hierarchical display system. For example, all pageslinked to from a given page may be displayed indented under the mainpage's URL. Such a display may be implemented in collapsible/expandableform. As discussed above, display may take into account site rankings.

[0028] The invention may be embodied in the form of a method, system,and computer program product (i.e., on a computer-readable medium).

[0029] Definitions of Terms

[0030] In describing the invention, the following definitions areapplicable throughout (including above).

[0031] A “computer” refers to any apparatus that is capable of acceptinga structured input, processing the structured input according toprescribed rules, and producing results of the processing as output.Examples of a computer include a computer; a general-purpose computer; asupercomputer; a mainframe; a super mini-computer; a mini-computer; aworkstation; a microcomputer; a server; an interactive television; ahybrid combination of a computer and an interactive television; andapplication-specific hardware to emulate a computer and/or software. Acomputer can have a single processor or multiple processors, which canoperate in parallel and/or not in parallel. A computer also refers totwo or more computers connected together via a network for transmittingor receiving information between the computers. An example of such acomputer includes a distributed computer system for processinginformation via computers linked by a network.

[0032] A “computer-readable medium” refers to any storage device usedfor storing data accessible by a computer. Examples of acomputer-readable medium include a magnetic hard disk; a floppy disk; anoptical disk, like a CD-ROM or a DVD; a magnetic tape; a memory chip;and a carrier wave used to carry computer-readable electronic data, suchas those used in transmitting and receiving e-mail or in accessing anetwork. “Memory” refers to any medium used for storing data accessibleby a computer. Examples include all the examples listed above under thedefinition of “computer-readable medium.” “Software” refers toprescribed rules to operate a computer. Examples of software includesoftware; code segments; instructions; computer programs; and programmedlogic.

[0033] A “computer system” refers to a system having a computer, wherethe computer comprises a computer-readable medium embodying software tooperate the computer.

[0034] A “network” refers to a number of computers and associateddevices that are connected by communication facilities. A networkinvolves permanent connections such as cables or temporary connectionssuch as those made through telephone or other communication links, orboth. Examples of a network include an internet, such as the Internet;an intranet; a local area network (LAN); a wide area network (WAN); anda combination of networks, such as an internet and an intranet.

BRIEF DESCRIPTION OF THE DRAWINGS

[0035] Embodiments of the invention will now be described with referenceto the attached drawings in which:

[0036]FIGS. 1A and 1B together illustrate the operation of a typicalprior-art GPSE, and FIG. 1A also partially illustrates the operation ofa crawler according to an embodiment of the present invention;

[0037]FIG. 2 illustrates the operation of a typical prior-artmetacrawler;

[0038]FIGS. 3A and 3B illustrate, along with FIG. 1A, the operation ofan embodiment of an SSSE according to the present invention;

[0039]FIG. 4 illustrates a configuration according to an embodiment ofthe present invention;

[0040]FIGS. 5A and 5B illustrate a depth monitoring process according toan embodiment of the invention;

[0041]FIGS. 6A, 6B, and 6C illustrate various embodiments of filteringoperations according to the present invention;

[0042]FIG. 7 illustrates an exemplary process used in implementing stepsof the embodiments shown in FIGS. 6A, 6B, and 6C; and

[0043]FIGS. 8A and 8B depict exemplary display formats according toembodiments of the invention.

DESCRIPTION OF EMBODIMENTS OF THE INVENTION

[0044] The general structure of an embodiment of an SSSE according tothe invention is shown in FIG. 4. As shown, there are three primarycomponents in the SSSE: smart crawler 16, host computer 17, and humaninterface 18. Smart crawler 16 operates, for the most part, as shown inFIG. 1A (that is, similar to prior-art crawler programs); however, thereare additional features that differentiate the inventive smart crawlerfrom prior-art crawlers discussed above. As is the case with typicalGPSEs, as discussed above, the crawler, in this case smart crawler 16,transmits information back to host computer 17; this is similar to hostmachine 2 in FIG. 1A, but it may also perform additional processes.Finally, human interface 18 is provided for entering search queries andfor, in some embodiments, human interaction in the processes ofinformation screening and indexing. The roles of these components willbecome clearer in view of the discussion below explaining the operationof the inventive SSSE.

[0045] As explained above, in one embodiment smart crawler 16 operatesin basically the same way as prior-art crawlers, i.e., by visiting sitesand transmitting information back to host computer 17. However, unlikeprior-art crawlers, in this embodiment smart crawler 16 does not operateunder the “everything is relevant” principle; rather, it operates asshown in FIG. 3A. In FIG. 3A, smart crawler 16 traverses the Internet 1and then performs a screening operation, denoted by site filter 11. Sitefilter 11 determines, based on terminology of the profession to whichthe SSSE is directed (e.g., law), whether or not each site is consideredto be relevant to the profession. The result is that some sites arefiltered out, leaving, essentially, an Internet 1′ containing onlyrelevant sites. It is only the information on relevant sites that istransmitted to host computer 17 in this embodiment. At host computer 17,the information on relevant sites, as determined by site filter 11, maybe stored in memory (not shown) for further processing, or it may beindexed and stored in a database. Filter 11 may be implemented in eitherautomated form or in a form requiring human interaction.

[0046] In another embodiment the filtering capabilities may beimplemented solely in the host computer 17. In this case, smart crawler16 returns all site information to host computer 17 for screening, andhost computer 17 makes all determinations as to whether or not sites arerelevant and as to when links to sites should be traversed or not. As inthe case of the previous embodiment, filtering may be automated ormanual (e.g., human editing).

[0047]FIG. 3B reflects further steps that may be carried out in someembodiments of the process carried out by the inventive SSSE. In suchembodiments, there is at least one additional level of filtering 13,which may be carried out either as part of smart crawler 16 or as aseparate process in host computer 17. As shown, the information (webpage) 12 of each site found relevant in the process shown in FIG. 3A maybe screened at least a second time, by filter 13, which again screensthe terminology found in the site information (in the case of a manualimplementation, a human may also be able to account for additional siteinformation, like the name of the site and the “overall feel” of thesite). Only information 14 that passes through this second filter 13 isthen indexed 15 and stored in database 9.

[0048] Therefore, in general, each site stored in the SSSE passesthrough one or more layers of screening, each of which may beimplemented in automated or manual form. In one exemplary embodiment,two automated filters are followed by human screening prior to indexing.

[0049] As discussed above, filter 13 acts to filter out irrelevant webpages (and similarly with any additional filters, if present). The pagesthat are filtered out are discarded, and no links to such pages aretraversed. Thereafter, if the smart crawler encounters a link to adiscarded page, it simply ignores it.

[0050] The strategy of using both a smart crawler having automatedfiltering and human editing in some embodiments of the inventioncombines the best of two worlds: the speed of the machine and thereasoning of man. The machine suggests a number of sites, the editorapproves or discards the sites, and the machine indexes the relevantpages on the approved sites. Based on links from the approved sites, thesmart crawler may suggest more sites, etc., resulting in an evolution ofthe search engine.

[0051] A further difference between an embodiment of the inventive smartcrawler and prior-art crawlers lies in the use of a “depth monitoringsystem” in connection with determining whether or not sites should evenbe visited. FIGS. 5A and 5B will be used to describe such a depthmonitoring system, according to an embodiment of the invention.

[0052]FIG. 5A depicts a “chunk” of the Internet and will be used as anaid in explaining FIG. 5B. FIG. 5A depicts a hierarchy of three levels:i, j, and k. Each level has at least one site. Additionally, linksbetween sites will be referred to below as “link_(xy),” where “xy”designates the fact that the link is from site x to site y. Note thatthe Internet, when viewed on a larger scale, is generally not ahierarchy; however, on a small scale, as depicted, it can be viewed assuch. In any event, the inventive system is applicable to the Internet,in general.

[0053]FIG. 5B gives a flowchart demonstrating the operation of theinventive depth monitoring system. Assume that site i has already beenvisited and that a link from site i to a site in the j^(th) level, say,j1, has been traversed (i.e., link_(ij1) has been traversed). When asite is initially visited, each of its links to further sites isassigned a depth equal to that of the link that was traversed to reachthe site, i.e., D_(link) _(jk) =D_(link) _(ij) for each outgoing link(jk; for this example, links j1k1 and j1k2); this is shown in step S1.In step S2, filter 11 makes a determination as to whether or not thesite visited (in this example, j1) is relevant. If so, the process goesto step S7, where the depths of all outgoing links from the site arereset to zero (i.e., in the ongoing example, step S7 would set D_(link)_(j1k1) =0 and D_(link) _(j1k2) =0). The process then traverses a linkto the next level S8 (e.g., from j1 to k1); in so doing, the currentsite (e.g., j1) will next be considered to be the previous site (thatis, i becomes j1), and the next site (e.g., k1) will be considered to bethe current site (that is, j becomes k1). From here, the process returnsto step S1. If step S2 determines that the current site (e.g., j1) isnot relevant, then step S3 increments D_(link) _(jk) for all outgoinglinks (in the example, j1k1 and j1k2). The process then proceeds to stepS4, where it is determined whether or not D_(link) for the outgoinglinks from the site exceeds a predetermined maximum value, D_(max). IfD_(link) _(jk) >D_(max), then no sites stemming from that site arevisited, and the links from that site are deleted S5; that is, the“branch” ends at that site. If this were the case, then the next site atthe same level of the hierarchy (here, j2) would be visited (if therewere no such site, then the process would go back to the previous levelto determine if there were another site to be visited from there, etc.)S6. In general, depth is monitored for each link traversed, until it isdetermined that at least one link from the original site leads to atleast one relevant site (i.e., within a depth of no more than D_(max);if this never happens, then all links from the site are deleted).

[0054] To understand this process more fully, consider the followingadditional example, where D_(max) is assumed to be two:

[0055] 1. A page A contains a link to the page B. Page A is deemedrelevant, so the link to B has depth 0.

[0056] 2. B has a link to C. B is deemed not relevant, so the link to Cis assigned the depth 1.

[0057] 3. C has a link to D. C is deemed not relevant, so the link topage D is assigned the depth 2.

[0058] 4. D has a link to E. D is deemed relevant, so the link to page Eis assigned the depth 0.

[0059] 5. Note that if D had been deemed not relevant, the link to pageE would have been assigned the depth 3, which is greater than D_(max).In this case, the link from D to E would have been deleted, and it wouldhave been determined if there were another site to be visited from C.

[0060] In a more concrete example, suppose there is a link towww.games.com and that the SSSE is geared toward the legal context. Itis most likely that www.games.com would be deemed not relevant in alegal context, so all the links from www.games.com to other pages, bothon www.games.com and other sites, would have the depth 1. Supposefurther that from www.games.com, the smart crawler follows the link towww.games.com/Review_The_Ultimate_Car_Game.html, which has a link towww.joysticks.com, from which there are further links. The link fromwww.games.com to www.games.com/Review_The_Ultimate_Car_Game.html will begiven a depth of 1, and the link from this page to wwwjoysticks.com willbe given a depth of 2. If the maximum depth is set to 2, and if the pagewwwjoysticks.com is deemed not relevant, the links from wwwjoysticks.comare discarded (again assuming a maximum depth of 2).

[0061] The embodiment of the invention discussed above makes use of atleast one automated filter. Exemplary embodiments of automated filteringare depicted in FIGS. 6A-6C FIG. 6A shows the basic idea of theexemplary embodiment of automated filtering according to the invention.A web page is input to the filter, and there is an optional step S1 ofremoving extraneous material, that is, of recognizing and eliminatingfrom consideration things like advertisements. The main part of thefilter takes the page and compares it with a lexicon of terms (e.g.,legal terms) S12 whose presence will indicate that the page may berelevant. If the comparison is favorable (this will be discussed furtherbelow) S13, then the page is saved S15. If not, then the page isdiscarded S14.

[0062]FIG. 6B shows a second exemplary embodiment of automatedfiltering. In this embodiment, prior to any analysis, the page underconsideration is broken up into component parts (cells) S16. This servesthe purpose of making it easier to discriminate between material thatneeds to be tested and material that is extraneous S11. It also permitsa piecemeal approach to testing. The components are passed into a teststage, where first it is determined if there are any remainingcomponents that need to be tested S17. If yes, then the next componentis compared with the lexicon S12′, and the process loops back to S17. Ifnot, then all components of the page have been tested, and the questionis asked as to whether or not there was at least one relevant componenton the page S18. If not, then the page is discarded S14. If yes, thenthe page is saved S15.

[0063]FIG. 6C depicts a fully component-oriented exemplary embodiment ofautomated filtering. As in FIG. 6B, the web page is broken up into itsconstituent components S16, and extraneous components may be removed S11. The process then determines if there is still a component of the pageleft to test S17. If not, the process ends S19. Otherwise, the nextcomponent is compared with the lexicon S12′. The process then determinesif the comparison results are favorable S20. If yes, then the componentis saved S22; if not, then it is discarded S21. In this manner, thedatabase that is built by the SSSE need only perform queries on relevantportions of pages, rather than on entire pages that may includeirrelevant material.

[0064] The above embodiments all include steps of comparing with alexicon and determining whether or not the comparison was favorable.FIG. 7 depicts an exemplary embodiment of how this may be done. For eachobject (page or component) to be tested, each word, term, or expressionin the object is compared with the words, terms, and expressions foundin the lexicon S20. Within the lexicon, different words, terms, andexpressions may be assigned different weights, for example, according torelative significance. If it is determined that a word, term, orexpression in the object matches one in the lexicon, the weight assignedto the word, term, or expression is added to a cumulative total weightfor the object S21. Once the entire object has been tested in thisfashion, the cumulative total weight is compared to a predeterminedthreshold value S22. The value of the threshold may be set according tohow selective the SSSE designer wants the database to be. If thecumulative total weight exceeds the threshold, the object is deemedrelevant and is saved S24. If not, then the object is deemed irrelevantand is discarded S23.

[0065] Also in two of the above embodiments is the step of breaking up aweb page into components S16. In an exemplary embodiment, this may bedone by splitting up each web page into cells, where a cell is a portionof the page. This is done by analyzing the HTML code for the page. Inone embodiment, cells may correspond to paragraphs of text; however,they may correspond to any desired components of the web page (e.g.,lines of text or different portions of a page having multiple areas oftext). In addition to the advantages of breaking up a web page into itscomponents S11 discussed above, this also makes it easier to removeextraneous material, like menus, banners, etc., leaving only the cellscontaining material that might contain relevant text.

[0066] One particular advantage to using a lexicon-based filter is thatall of the components of the filter may be the same for anycontext/profession, except for the lexicon. Therefore, one need onlychange the lexicon accessed by the other components in order to create asearch engine for a different context/profession. This may be donewithin the host computer 17 (in FIG. 4) by referencing a differentmemory for each context/profession. This may, in turn, be done byreferencing a different file in a memory (for example, on a hard driveof the host computer) or by replacing a replaceable memory component(for example, a floppy disk or a CD-ROM).

[0067] In one embodiment of the invention, an inventive site rankingfeature is also included. This ranking system analyzes the Internet(i.e., the sites found) to determine the degree to which sites have beenfound interesting by others in the desired context/profession. Inparticular, in one embodiment, this is determined by finding the numberof links and citations to sites from other relevant sites determined bya user query. This information may be used in conjunction withdisplaying the results of the query, in order to emphasize the sitesmost likely to be helpful.

[0068] In a further embodiment of the invention, the site rankingfeature is implemented using a word ranking scheme. The basic idea ofthis technique is to assign numerical scores to words and to sum thescores of the words on a page to determine a score for the page. Thetechnique works by examining each word (non-trivial word, i.e., not“stop words,” like “if,” “it,” “and,” and the like) on a given page andincreasing its score if it appeared on a relevant page (i.e., a pagethat passed through filtering) containing a link to the given page. In asub-embodiment, the word score is increased according to how manyrelevant pages that linked to the given page contain the word. In afurther embodiment, the technique is augmented by increasing a word'sscore according to where it appears in a link leading to the page beingexamined. In particular, if the word appears closer in proximity to thelink to the page being examined, its score is increased.

[0069] A word score is saved for each word on each page (i.e., exceptfor stop words, as discussed above). When a user enters a query, theinventive SSSE determines a set of (relevant) pages that contain thequery terms. For each page, the word scores are summed for the words ofthe query to compute a site ranking for that page. The site rankings forthe pages are then used in determining how to display the search resultsto the user. In summary, the inventive system utilizes dynamic siterankings, computed based on word rankings and in response to userqueries.

[0070] A further feature according to an embodiment of the invention isa user-friendly display of results. In a preferred embodiment, thisuser-friendly display is a hierarchical type of display. In a furtherembodiment, the display uses the site ranking feature to determine anorder in which to display the results. That is, the most relevant sites,as determined by their rankings, would be displayed earlier in thedisplay and/or more prominently than less high ranking sites.

[0071]FIGS. 8A and 8B show two exemplary embodiments of a displayaccording to the present invention. FIG. 8A shows a display in the formof a file-document type of hierarchy. A file 20, 22 represents a type ofpages/sites that it contains. As shown, the type may contain additionalsub-types (shown as files). The type also contains documents, whichrepresent the actual pages/sites. One traverses the hierarchy byclicking on files 20, 22 to open them until one locates a desireddocument 21. One then clicks on the document 21 to access theinformation or site.

[0072] Similarly, FIG. 8B shows a display in menu form. In the depictionof FIG. 8B, there are six “site types” that represent six differentclasses of information found during a search of the SSSE database. As inconventional menu-based system, if there is an arrow in the mend, thatindicates another level of menu. In the example shown, a user hasclicked on Site Type F to reveal three sites (F1, F2, and F3). The usermay then access any particular one of these sites by clicking on theappropriate menu item.

[0073] In a further embodiment of the invention, the “files” or “sitetypes” in one level of the hierarchy may consist of URLs of sites, andthe next level of the hierarchy may then contain “files”/“site types”and “documents”/“sites” linked to from those URLs.

[0074] Note that, while the invention has been described above in thecontext of the Internet, it may be similarly applied to any othercomputer network.

[0075] The inventive procedure is based on the principle that “mostpages are not relevant” and that the inventive SSSE should separate the“straw” from the “chaff.” This permits the inventive system not to visitevery page on the Internet because it can quickly determine that a siteis not relevant and, as a result, all the pages on that site are notindexed. One of the consequences of this is that highly irrelevantpages, like most “free home pages,” are discarded. Another consequenceis that the inventive system builds a fairly large database of relevantmaterial very rapidly.

[0076] The invention has been described in detail with respect topreferred embodiments, and it will now be apparent from the foregoing tothose skilled in the art that changes and modifications may be madewithout departing from the invention in its broader aspects. Theinvention, therefore, as defined in the appended claims, is intended tocover all such changes and modifications as fall within the true spiritof the invention.

We claim:
 1. A method of compiling and accessing subject-specificinformation from a computer network, the method comprising the steps of:traversing links between sites on the computer network; filtering thecontents of each site visited to determine relevancy of content; andpresenting information on each site deemed relevant for indexing.
 2. Themethod according to claim 1, further comprising the step of: filteringthe contents of a site at least a second time for relevancy, prior tothe step of presenting.
 3. The method according to claim 2, wherein atleast one of said filtering steps comprises the steps of: presenting thecontents to a human editor; approving, by the human editor, if thecontents are deemed relevant; and disapproving, by the human editor, ifthe contents are not deemed relevant.
 4. The method according to claim2, wherein at least one of said filtering steps comprises the step of:passing the contents of the site through a lexicon-based filter, thefilter comparing contents of the site with terminology found in thelexicon.
 5. The method according to claim 4, wherein the step of passingthe contents of the site through a lexicon-based filter comprises thesteps of: breaking up a web page corresponding to the site contents intocomponent parts; and comparing the contents of each component part withthe lexicon.
 6. The method according to claim 5, wherein the step ofpassing the contents of the site through a lexicon-based filter furthercomprises the steps of: assigning a weight to each component part basedon a result of the step of comparing; and deeming the component part tobe relevant if it achieves a high-enough weight.
 7. The method accordingto claim 6, wherein the step of assigning a weight comprises the stepsof: assigning a weight to each word, term, or expression in thecomponent part that matches a word, term, or expression in the lexicon,according to a weight associated with the word, term, or expression; andaccumulating a sum of assigned weights, the sum forming the weightassigned to the component part.
 8. The method according to claim 6,wherein the step of passing the contents of the site through alexicon-based filter further comprises the steps of: saving componentparts deemed to be relevant and passing them to the presenting step; anddiscarding component parts deemed not to be relevant.
 9. The methodaccording to claim 6, wherein the step of passing the contents of thesite through a lexicon-based filter further comprises the steps of: ifat least one component part is deemed to be relevant, passing the webpage to the presenting step; and if no component part is deemed to berelevant, discarding the web page.
 10. The method according to claim 4,wherein the step of passing the contents of the site through alexicon-based filter comprises the step of: comparing the contents of aweb page corresponding to the site with the lexicon.
 11. The methodaccording to claim 10, wherein the step of passing the contents of thesite through a lexicon-based filter further comprises the steps of:assigning a weight to the web page based on a result of the step ofcomparing; and deeming the web page to be relevant if it achieves ahigh-enough weight.
 12. The method according to claim 11, wherein thestep of assigning a weight comprises the steps of: assigning a weight toeach word, term, or expression in the web page that matches a word,term, or expression in the lexicon, according to a weight associatedwith the word, term, or expression; and accumulating a sum of assignedweights, the sum forming the weight assigned to the web page.
 13. Themethod according to claim 11, wherein the step of deeming comprises thesteps of: saving the web page and passing it to the step of presentingif it achieves a high-enough weight; and discarding the web page if itdoes not achieve a high-enough weight.
 14. The method according to claim1, wherein the step of filtering the contents comprises the step of:passing the contents of the site through a lexicon-based filter, thefilter comparing contents of the site with terminology found in thelexicon.
 15. The method according to claim 14, wherein the step ofpassing the contents of the site through a lexicon-based filtercomprises the steps of: breaking up a web page corresponding to the sitecontents into component parts; and comparing the contents of eachcomponent part with the lexicon.
 16. The method according to claim 15,wherein the step of passing the contents of the site through alexicon-based filter further comprises the steps of: assigning a weightto each component part based on a result of the step of comparing; anddeeming the component part to be relevant if it achieves a high-enoughweight.
 17. The method according to claim 16, wherein the step ofassigning a weight comprises the steps of: assigning a weight to eachword, term, or expression in the component part that matches a word,term, or expression in the lexicon, according to a weight associatedwith the word, term, or expression; and accumulating a sum of assignedweights, the sum forming the weight assigned to the component part. 18.The method according to claim 16, wherein the step of passing thecontents of the site through a lexicon-based filter further comprisesthe steps of: saving component parts deemed to be relevant and passingthem to the presenting step; and discarding component parts deemed notto be relevant.
 19. The method according to claim 16, wherein the stepof passing the contents of the site through a lexicon-based filterfurther comprises the steps of: if at least one component part is deemedto be relevant, passing the web page to the presenting step; and if nocomponent part is deemed to be relevant, discarding the web page. 20.The method according to claim 14, wherein the step of passing thecontents of the site through a lexicon-based filter comprises the stepof: comparing the contents of a web page corresponding to the site withthe lexicon.
 21. The method according to claim 20, wherein the step ofpassing the contents of the site through a lexicon-based filter furthercomprises the steps of: assigning a weight to the web page based on aresult of the step of comparing; and deeming the web page to be relevantif it achieves a high-enough weight.
 22. The method according to claim21, wherein the step of assigning a weight comprises the steps of:assigning a weight to each word, term, or expression in the web pagethat matches a word, term, or expression in the lexicon, according to aweight associated with the word, term, or expression; and accumulating asum of assigned weights, the sum forming the weight assigned to the webpage.
 23. The method according to claim 21, wherein the step of deemingcomprises the steps of: saving the web page and passing it to the stepof presenting if it achieves a high-enough weight; and discarding theweb page if it does not achieve a high-enough weight.
 24. The methodaccording to claim 14, further comprising the step of: filtering thecontents of a site at least a secondtime for relevancy, prior to thestep of presenting.
 25. The method according to claim 24, wherein thestep of filtering the contents at least a second time comprises thesteps of: presenting the contents to a human editor; approving, by thehuman editor, if the contents are deemed relevant; and disapproving, bythe human editor, if the contents are not deemed relevant.
 26. Themethod according to claim 14, further comprising the step of: replacingthe lexicon with a lexicon corresponding to a different subject in orderto create a different subject-specific database.
 27. The methodaccording to claim 1, further comprising the step of: compiling adatabase of searchable relevant information.
 28. The method according toclaim 27, further comprising the steps of: permitting a user to enter aquery; and searching the database for information according to thequery.
 29. The method according to claim 28, further comprising the stepof: displaying information found in said step of searching in ahierarchical format.
 30. The method according to claim 28, furthercomprising the step of: determining a site ranking for each siteassociated with information found in said searching step, where thedetermining is according to how interesting at least one of authors andusers of the computer network have found the site associated with theinformation.
 31. The method according to claim 30, further comprisingthe step of: displaying the results of the user query using the siteranking of each item of [SAM1] information found in the search todetermine an order in which the results are displayed.
 32. The methodaccording to claim 31, wherein the step of displaying the results of theuser query comprises the step of: displaying the results of the userquery in a hierarchical format according to site ranking.
 33. The methodaccording to claim 27, wherein the step of compiling a databasecomprises the step of: for each relevant site to be stored in thedatabase, assigning a word score to each word appearing on that site.[SAM3]
 34. The method according to claim 33, wherein the step ofassigning word scores comprises the steps of: determining all sitesfound in the database that contain links to the site; for each word onthe site, assigning a word score for that word based at least in part onits presence on each site containing a link to the site.
 35. The methodaccording to claim 34, wherein the step of assigning a word score forthat word further comprises the step of increasing the word score foreach site containing a link to the site if the word appears in closeproximity to the link.
 36. The method according to claim 33, wherein thestep of assigning word scores comprises the steps of: determining allsites found in the database that contain links to the site; andassigning a word score to each word on the site based at least in parton how many sites linking to the site also contain the particular word.37. The method according to claim 36, wherein the step of assigning aword score for that word further comprises the step of increasing theword score for each site containing a link to the site according to theproximity of the word to the link.
 38. The method according to claim 33,further comprising the steps of: entering a user query; using the userquery to search the database; and computing a site ranking for each siteassociated with information found in said searching step, the siteranking being computed based on said word scores.
 39. The methodaccording to claim 38, wherein the step of computing a site rankingcomprises the steps of: for each site associated with information foundin said searching step, summing the word scores for that sitecorresponding to words in the user query.
 40. A computer-readable mediumcontaining software implementing the method as claimed in claim
 1. 41. Asystem for compiling and accessing information from a computer network,the system comprising: a processor; and a computer-readable medium asclaimed in claim
 40. 42. The method according to claim 1, furthercomprising the step of: monitoring a depth for each link, the depthbeing a reflection of relevance.
 43. The method according to claim 42,wherein the step of monitoring comprises the steps of: for a given sitebeing visited, setting depths of any links leading from that site toother sites to a depth of a link traversed to reach the given site; ifthe given site is determined to be relevant in the filtering step,setting the depths of the links leading from that site to zero; and ifthe given site is determined not to be relevant in the filtering step,incrementing the depths of the links leading from that site.
 44. Themethod according to claim 43, wherein the step of monitoring furthercomprises the steps of: comparing the incremented depths to apredetermined maximum depth value; if the incremented depths exceed thepredetermined maximum depth value, discarding the links leading from thegiven site; if the incremented depths do not exceed the predeterminedmaximum depth value, traversing one of the links leading from the givensite.
 45. The method according to claim 1, wherein said filtering stepcomprises the steps of: presenting the contents to a human editor;approving, by the human editor, if the contents are deemed relevant; anddisapproving, by the human editor, if the contents are not deemedrelevant .
 46. A system that compiles and permits accessing ofsubject-specific information from a computer network, the systemcomprising: a host computer executing software from a computer-readablemedium, the software comprising: a smart crawler for traversing thecomputer network; a first filter, filtering out irrelevant sites, andpermitting only relevant sites to pass; and an indexer indexing therelevant sites; and memory, connected to the host computer, for storingindexed subject-specific information.
 47. The system according to claim46, wherein said first filter comprises a lexicon-based filter.
 48. Thesystem according to claim 47, wherein the system further comprises aninterchangeable computer-readable medium on which is stored the lexiconfor the lexicon-based filter, the lexicon containing subject-specificterminology.
 49. The system according to claim 46, wherein the softwarefurther comprises at least a second filter.
 50. The system according toclaim 49, wherein the system further comprises a human-computerinterface, and wherein at least one of said first filter and said atleast a second filter comprises: a presentation of relevant siteinformation received from the smart crawler to a human editor via thehuman-computer interface; and means for receiving input from the humaneditor, entered via the human-computer interface, as to whether or notto index and store the site in the memory.
 51. The system according toclaim 49, wherein at least one of said first filter and said at least asecond filter comprises a lexicon-based filter.
 52. The system accordingto claim 51, wherein the system further comprises an interchangeablecomputer-readable medium on which is stored the lexicon for thelexicon-based filter, the lexicon containing subject-specificterminology.
 53. The system according to claim 46, wherein the systemfurther comprises a human-computer interface, and wherein said firstfilter comprises: a presentation of relevant site information receivedfrom the smart crawler to a human editor via the human-computerinterface; and means for receiving input from the human editor, enteredvia the human-computer interface, as to whether or not to index andstore the site in the memory.
 54. A method of ranking the relevance ofinformation stored in a database, the information comprising web pages,the method comprising the steps of: computing and storing a word rankingfor each word, except for stop words, found on each web page; and inresponse to a user query, computing a site ranking for each web pagefound in response to the user query based on the word rankings.
 55. Themethod according to claim 54, wherein the step of computing a wordranking is performed according to how interesting at least one ofauthors and users of a computer network in which each web page isresident have found the web page.
 56. The method according to claim 54,wherein the step of computing a word ranking comprises the step of: foreach word, except stop words, on each web page, determining all webpages found in the database that contain links to the web page on whichthe word appears; and assigning a word score for that word based atleast in part on its presence on each web page containing a link to theweb page on which that word appears, the word score constituting theword ranking for that word.
 57. The method according to claim 56,wherein the step of assigning a word score for that word furthercomprises the step of increasing the word score for each web pagecontaining a link to the web page on which that word appears if the wordappears in close proximity to the link.
 58. The method according toclaim 54, wherein the step of computing a site ranking comprises thesteps of: for each web page found in response to the user query, summingthe word rankings for that web page corresponding to words in the userquery.
 59. A computer-readable medium containing software implementingthe method of claim 54.